Goto

Collaborating Authors

 clusterwise regression


A Generalized Framework for Predictive Clustering and Optimization

arXiv.org Artificial Intelligence

Clustering is a powerful and extensively used data science tool. While clustering is generally thought of as an unsupervised learning technique, there are also supervised variations such as Spath's clusterwise regression that attempt to find clusters of data that yield low regression error on a supervised target. We believe that clusterwise regression is just a single vertex of a largely unexplored design space of supervised clustering models. In this article, we define a generalized optimization framework for predictive clustering that admits different cluster definitions (arbitrary point assignment, closest center, and bounding box) and both regression and classification objectives. We then present a joint optimization strategy that exploits mixed-integer linear programming (MILP) for global optimization in this generalized framework. To alleviate scalability concerns for large datasets, we also provide highly scalable greedy algorithms inspired by the Majorization-Minimization (MM) framework. Finally, we demonstrate the ability of our models to uncover different interpretable discrete cluster structures in data by experimenting with four real-world datasets.


Topology-based Clusterwise Regression for User Segmentation and Demand Forecasting

arXiv.org Machine Learning

Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.